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Convolutional layer 
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(m X n X d) kxkxd pooling linearity 


Size of the feature maps depends on the size of input for a given network. 
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Feature maps 





Convolutional layers can be applied to input images of any size. 
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Fully connected layers 


Fully connected layers require a fixed size input. 


They cannot be applied to images of different sizes. 
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Size requirements 
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Pooling 


> Pooling function: Generates an aggregated representation for a set of features 
SAN 
vectors | l. J - 
> Average pooling: ey fi. 
> Max pooling: Element-wise maximum 


sr 
> Second-order pooling: Ei fifi. 


> The size of the pooling output does not depend on the number of features N. 
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Spatial pyramid pooling 


> Introduced in [Lazebnik 2006]. 


> Three steps: 


> Extract local feature descriptors at each pixel. 
































> Divide the image into cells of different sizes. 




















> Apply pooling function to each cell and concatenate all 
the pooling outputs. 





S. Lazebnik, C.Schmid, and J.Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natural scene 
categories”, CVPR 2006. 





Spatial Pyramid Pooling in CNNs 


fully-connected layers (fcs, fc7) 
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fixed-length representation 
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Experiments on ImageNet 2012 


> Experimented with three different CNN architectures. 
> ZF-5 ([Zeiler 2013], 5 convolutional layers) 
> Convenet-5 ([Krizhevsky 2012], 5 convolutional layers) 
> Overfeat-5/7 ([Sermanet 2013], 5/7 convolutional layers) 


Multiple 224 x 224 
images 


Crop or resize Convolutional Fully-connected 
Input Output 
to a fixed size layers layers 
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Spatial pyramid pooling improves accuracy 


Multiple 224 x 224 
images 
| 7 Spatial 
input Crop or resize Convolutional eran Fully-connected Outpul 
to a fixed size layers layers 





pooling 


top-1 error (70) 
ZF-5 Convnet*-5 Overteat-5 Overtfeat-7 
(a) no SPP 35.99 34.93 34.13 32.01 
(b) SPP 34.98 aop 34.38 (0.55) 32.87 (1.26) 30.36 (1.65) 
top-5 error (70) 
ZF-5 Convnet*-5 Overteat-5 Overtfeat-7 
(a) no SPP 14.76 13.92 13.52 11.97 
(b) SPP 14.14 (0.62) 13.54 (0.38) 12.80 (0.72) 11.12 (0.85) 
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Multiscale training improves accuracy 


> Training with two sizes (224x224, 180x180), testing with 224x224 images. 


Multiple 224 x 224 images 






Dist Crop or resize Convolutional opata Fully-connected Outout 
P to a fixed size | layers PY layers P 


pooling 


Multiple 180 x 180 images 


(only while training) top-1 error (%) 


ZF-5 Convnet*-5 Overfeat-5 
(a) no SPP 35.99 34.93 34.13 
(b) SPP single-size trained 34.98 (1.01) 34.38 (0.55) 32.87 (1.26) 
(c) SPP multi-size trained | 34.60 1.30) 33.94 (0.99) 32.26 (1.87) 


top-5 error (70) 
ZF-5 Convnet*-5 Overfeat-5 
(a) no SPP 14.76 13.92 13.52 
(b) SPP single-size trained | 14.14 {0.62 13.54 (0.38) 12.80 (0.72) 
ww ai be) cm $PP onatlti-size trained 13.64 (1.12) 13.33 (0.59) 12.33 (1.19) 


Overteat-7 
32.01 
30.36 (1.65) 
29.68 (2.33) 


Overteat-7 
11.97 
11.12 (0.85) 
10.95 (1.02) 


Reducing computation time 


Multiple 224 x 224 images 
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multiple regions 





Multiple pooled outputs 


Much faster than applying convolutional layers to multiple images. 
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Multiscale network results 


> Resized each image to six different scales. 


> Applied CNN with SPP to six images. 
> For each scale, SPP was applied to multiple regions in the final feature maps. 


> A total of 98 different outputs were obtained from each image. 
> Final result was based on the average of 98 outputs. 





method st scales top-1 val top-5 val  top-5 test 

Krizhevsky et al. [3] 10 40.7 18.2 
Overteat (fast) [5] | 39.01 16.97 
Overteat (fast) [5] 38.12 16.27 
Overfeat (big) [5] é 35.74 14.18 
Howard (base) [32] 3 162 37.0 15.8 
Howard (high-res) [32] 3 162 36.5 16.2 
Zeiler & Fergus (ZF) (fast) [4] 1 10 38.4 16.5 
Zeiler & Fergus (ZF) (big) [4] 10 37.5 16.0 
Chatfield et al. [6] - 13.1 
ours 10 29.68 10.95 

ours 96+2full 27.86 9.14 9.08 




















Detection on PascalVOC 2007 using RCNN 


> Generate 2000 object proposals using selective search. 
> Resize each region into a pre-defined size (227x227). 
> Extract features from each region using a deep CNN. 
> Classify these features using SVM detectors. 

> Runs CNN 2000 times. 


K-CNN: Kegions wi ai CNN features 


i tvmonitor? mo. 


1. pa 2. Extract region 3. Compute 4. Classify 
image proposals (~2k) CNN features regions 
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Detection using CNN+5PP 


> Run convolutional layers on the entire image only once. 
> Generate 2000 object proposals using selective search. 


> Map each object proposal region in the input image to the corresponding 
region in the output of final convolutional layer. 


> Use SPP to extract features from the final convolutional layer for each object 
proposal. 


> Classify these features using SVM detectors. 


mAP 58.0 58.5 
conv time (GPU) 0.053s 8.96s 
fc time (GPU) 0.089s 0.07s 
total time (GPU) 0.142s 9.03s 


speedup (vs. KCNN) | 64 z 





Detection resutls 
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Thank You 





